The project is aimed at doing exploratory data analysis on Prosper Loan Data . This is a financial data related to loan browers , lenders , interest rate many more. Prosper.com is one of the leading peer to peer lending sites on the web today. ===============================================================================
## 'data.frame': 113937 obs. of 81 variables:
## $ ListingKey : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
## $ CreditGrade : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ ClosedDate : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating..Alpha. : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ CurrentlyInGroup : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
## $ GroupKey : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
## $ DateCreditPulled : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment : num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent..percentage. : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanKey : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
## $ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
## $ MemberKey : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
## $ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
There are around 81 variables. Let us ask some question and explore the dataset for better understanding.
From the graph above the number of loans borrowed has incresed from 2005 - 2008. Then there is a drop in the number of loans offered. Again the loan number has been increasing from 2010 - 2013. But again there is a dip in 2014.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8337 12000 35000
ggplot(pd, aes(LoanOriginalAmount)) +
geom_density(fill = "#45A3E6")+
geom_vline(xintercept = 8337, size = 1, colour = "#FF3721",
linetype = "dashed")+
ggtitle("Loan Amount Distribution")
The loan amount is highly skewed to the left, which means to say that most of the loan is for small amount. The maximun loan amount borrowed is 35000 and minimum is 1000. The mean loan amount is 8337.
## Cancelled Chargedoff Completed
## 5 11992 38074
## Current Defaulted FinalPaymentInProgress
## 56576 5018 205
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 16 806 265
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 363 313 304
Most of the loans are current. There are also some loans which are defaulted and chargedoff.
Prosper seems to offer loans for 12 , 36 or 60 months term.The most common being 36 months loan term.
The most common purpose to borrow loan from Prosper is for Debt Consolidation followed by other.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 19.0 679.0 699.0 704.6 739.0 899.0 591
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 660.0 680.0 685.6 720.0 880.0 591
The distribution of Credit score upper and lower range look almost normally distributed. They do have some outliers and NA(non available) data. Credit score is a inflencing factor for getting a better interest rate. Instead of 2 separate variables we can combine them into one and call it CreditScore.
summary(pd$CreditScore)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 9.5 669.5 689.5 695.1 729.5 889.5 591
The Credit Score follows the same pattern as CreditScore Upper nad Lower Range of normal distribution. The mean Creditscore is 695.1
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00653 0.15629 0.20976 0.21883 0.28381 0.51229 25
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1340 0.1840 0.1928 0.2500 0.4975
The distribution of Borrower APR is almost normal with a mean rate of 0.21883 The distribution of Borrower Interest Rate is almost normal with a mean rate of 0.1928.
Prosper score is a custom risk score built using historical Prosper data to assess the risk of Prosper borrower listings. The output to Prosper users is a Prosper score which ranges from 1 to 11, with 11 being the best, or lowest risk, score. The worst, or highest risk, score, is a 1. In this dataset prosper Score looks normally distributed. With a mean score of 5.95 .Also to be noted there are a number of rows with data NA(not available) this is because as stated in the variable definition sheet the risk score is applicable only from 2009 onwards.
The Prosper Rating is proprietary system that allows prosper to maintain consistency when evaluating each loan application. Prosper Ratings allow investors to consider a loan’s level of risk because the rating represents an estimated average annualized estimated loss rate range.
About 67322 borrowers are employed. But i think part-time,full-time should belong to emplyoed category. It is interesting to know that even “Not employed” people get loan.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.276 0.320 10.010 8554
The debt to income ration is highly skewed to the left. The mean debttoincome ratio is 0.276.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3200 4667 5608 6825 1750003
From the plot we can observe the income is skewed to the left. Very few borrowers earn more than 10000. The mean monthly income is 5608.
## 2005 Q4 2006 Q1 2006 Q2 2006 Q3 2006 Q4 2007 Q1 2007 Q2 2007 Q3 2007 Q4
## 22 315 1254 1934 2403 3079 3118 2671 2592
## 2008 Q1 2008 Q2 2008 Q3 2008 Q4 2009 Q2 2009 Q3 2009 Q4 2010 Q1 2010 Q2
## 3074 4344 3602 532 13 585 1449 1243 1539
## 2010 Q3 2010 Q4 2011 Q1 2011 Q2 2011 Q3 2011 Q4 2012 Q1 2012 Q2 2012 Q3
## 1270 1600 1744 2478 3093 3913 4435 5061 5632
## 2012 Q4 2013 Q1 2013 Q2 2013 Q3 2013 Q4 2014 Q1
## 4425 3616 7099 9180 14450 12172
Looks like proser laon data is from last quarter of 2005 until first quarter of 2014. The number of loans offere by prosper has increase gradually after the correction perios of 2009. https://en.wikipedia.org/wiki/Prosper_Marketplace
The prosper loan data set consist of 113,937 loans and 81 variables. Most data is recorded between 2005 to 2014. Some of the variables include information about the loan such as the amount, rate, length, and status, and also information on the borrower, such as credit grade, income range, and employment status and so on.
The dataset has 81 variables , in this section i am interested to understand the variable , look at their distribution and try to find some insights. For univariate analysis i have focused on the follwoing variables. ListingCreationDate Term ListingCategory..numeric. ProsperScore CreditScoreRangeUpper CreditScoreRangeLower LoanStatus LoanOriginalAmount BorrowerAPR BorrowerRate ProsperScore ProsperRating EmploymentStatus
Based on my research about Prosper score i found that variables may also be interesting to anayse. CurrentDelinquencies StatedMonthlyIncome DebtToIncomeRatio https://www.prosper.com/plp/general-prosper_score/
Yes i created one variable called CreditScore by takin an average of CreditScoreRangeUpper and CreditScoreRangeLower. I also created variable called LoanOriginationQuarter using the zoo library.
When analysisng the distribution of no of loans vsr years , a dip was observed around 2008-2009.
The debt to income ratio was cut into bands to be able to analyse borrowers by debt to income ratio group.
## pd$ProsperRating..Alpha.:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 9.5 609.5 649.5 653.9 709.5 889.5 591
## --------------------------------------------------------
## pd$ProsperRating..Alpha.: A
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 649.5 709.5 729.5 739.4 769.5 889.5
## --------------------------------------------------------
## pd$ProsperRating..Alpha.: AA
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 689.5 749.5 789.5 783.6 809.5 889.5
## --------------------------------------------------------
## pd$ProsperRating..Alpha.: B
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 609.5 689.5 709.5 716.4 749.5 869.5
## --------------------------------------------------------
## pd$ProsperRating..Alpha.: C
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 609.5 669.5 689.5 699.4 729.5 889.5
## --------------------------------------------------------
## pd$ProsperRating..Alpha.: D
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 609.5 669.5 689.5 689.8 709.5 869.5
## --------------------------------------------------------
## pd$ProsperRating..Alpha.: E
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 609.5 649.5 669.5 672.0 689.5 869.5
## --------------------------------------------------------
## pd$ProsperRating..Alpha.: HR
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 609.5 669.5 689.5 686.5 709.5 869.5
There is a clear indication as the credit score increase the Prosper rating also increases. As the credit score increases the Prosper Rating falls under AA catergory which stands for lower risk.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 9.5 669.5 689.5 695.1 729.5 889.5 591
Borrowers APR and Credit Score have a negative corelation. The Borrower APR increases with increase in Credit Score.
It is interesting to know that the prosper score is high for higher loan amount.
The monthly income is a very important factor which inflence the prosper score. From the plot above we can see prosper score increase with larger stated monthly income.
The debt-to-income ratio is one way lenders, including mortgage lenders, measure an individual’s ability to manage monthly payment and repay debts. So it is necssary to have a low debt to income ratio for better rate and the graph also shows us the borrower rate better rate when the debtTo income ratio is low.
Some of the key observations are 1) Credit score increase the Prosper rating also increases. As the credit score increases the Prosper Rating falls under AA catergory which stands for lower risk. 2) Prosper score is high for larger loan amount.
Borrowers APR and Credit Score have a negative corelation. The Borrower APR increases with increase in Credit Score.
In above plot it can be clearly seen Not Employed and Retired APR has been steady and higher for prosper scores below six.
## $title
## [1] " BorrowerAPR VS Credit Score with colored Prosper Rating"
##
## $subtitle
## NULL
##
## attr(,"class")
## [1] "labels"
The loans high prosper score and credit score had a lower Borrower Rate. From the above plot we can see the Prosper score of 7.5 and above and a credit score of 700 and above had a low borrower rate.
The multivarite plots futher prove that prosper score and credit rate have a very high inflence on the borrowers rate.
It is observed that empllyoment status had an influence on APR and credit score. Not Employed and Retired borrowers APR has been higher for prosper scores below six. It is interesting to note that self emplyoed managed to get better APR and prosper score.
In this plot it can be observed ProsperRating Increases with increase in the credit score.
The plot shows emplyoed borrower get higher prosperscore and also better APR . It can also be seen self-emplyoed borrowes also seem to get almost the same kind of prosper rating and APR value. A combination of other vaiables help to decide on the prosper score and APR.
The plot above shows the distribution of loans over a time period of 2005- 2014.Most importantly it the trend the loans follow before and after 2009 Q2. We can observe for a given year the loan amount increase from first quarter to last quarter.
This data set was a challenge mainly because of the number of variables. It gave a good experience on how real wrold data is and since my knowledge on loans was limited i had to spend some time understanding the terms and also make decisions on what kind of questions to ask to explore this Prosper loan data.
Working with ggplot2 to explore th variables was not easy. I had to iterate the code many times to get a satisfactory representations. The main area of difficulty is choose the correct scale so the graph can represent the data better. Another chanllenge was choosing the right graph type since the dataset has both categorical and quantitative values.
While testing the bivariate and multivarite plots some of the relationships was not as i expected for example - Borrowers APR and Credit Score has a negative corelation. The Borrower APR increases with increase in Credit Score. Atleast to my understanding i thought the borrower can get a lower APR with a good credit score rating.
I did manage to observe some interesting factors about indidvidual variables expecially i would like to mention the distribution of loan across the quarters gives a good overview of Prosper loan performance. This promted me to go look for the reason why there was a dip in loan sales during 2009. I also managed to find some key variable which influence the interest rate. I found the Prosper Score and Prosper Rating are key variables which affect the borrowers APR and interest rate to name some.
After doing a lot of reading i understood there are many interesting questions to be askes and explored. This is just a small analysis presented here for the project requirements, I wish to further continue and explore the dataset especially i want to see distribution of loan across different states, which state perform well in paying the laon on time. Also looking at what king of loans investors prefer to invest in and what a re the key factors whaich help them in this process may be a good area of investigation moving forward.